Search CORE

66 research outputs found

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Author: Aliferis Constantin F
Statnikov Alexander
Wang Lily
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

Author: Aliferis Constantin F.
Statnikov Alexander
Tsamardinos Ioannis
Publication venue: Libertas Academica
Publication date: 01/01/2006
Field of study

Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

Directory of Open Access Journals

PubMed Central

Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study

Author: Aliferis Constantin F.
Li Chun
Statnikov Alexander
Publication venue: Public Library of Science
Publication date: 01/09/2007
Field of study

The development of new high-throughput genotyping technologies has allowed fast evaluation of single nucleotide polymorphisms (SNPs) on a genome-wide scale. Several recent genome-wide association studies employing these technologies suggest that panels of SNPs can be a useful tool for predicting cancer susceptibility and discovery of potentially important new disease loci.In the present paper we undertake a careful examination of the relative significance of genetics, environmental factors, and biases of the data analysis protocol that was used in a previously published genome-wide association study. That prior study reported a nearly perfect discrimination of esophageal cancer patients and healthy controls on the basis of only genetic information. On the other hand, our results strongly suggest that SNPs in this dataset are not statistically linked to the phenotype, while several environmental factors and especially family history of esophageal cancer (a proxy to both environmental and genetic factors) have only a modest association with the disease.The main component of the previously claimed strong discriminatory signal is due to several data analysis pitfalls that in combination led to the strongly optimistic results. Such pitfalls are preventable and should be avoided in future studies since they create misleading conclusions and generate many false leads for subsequent research

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

The FAST-AIMS Clinical Mass Spectrometry Analysis System

Author: Aliferis Constantin F.
Fananapazir Nafeh
Statnikov Alexander
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2009
Field of study

Within clinical proteomics, mass spectrometry analysis of biological samples is emerging as an important high-throughput technology, capable of producing powerful diagnostic and prognostic models and identifying important disease biomarkers. As interest in this area grows, and the number of such proteomics datasets continues to increase, the need has developed for efficient, comprehensive, reproducible methods of mass spectrometry data analysis by both experts and nonexperts. We have designed and implemented a stand-alone software system, FAST-AIMS, which seeks to meet this need through automation of data preprocessing, feature selection, classification model generation, and performance estimation. FAST-AIMS is an efficient and user-friendly stand-alone software for predictive analysis of mass spectrometry data. The present resource review paper will describe the features and use of the FAST-AIMS system. The system is freely available for download for noncommercial use

Crossref

Directory of Open Access Journals

PubMed Central

Causal graph-based analysis of genome-wide association data in rheumatoid arthritis

Author: Ai Jizhou
Alekseyenko Alexander V
Aliferis Constantin F
Ding Bo
Lytkin Nikita I
Padyukov Leonid
Statnikov Alexander
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease. Results Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype. Conclusions Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery. Reviewers This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Analysis and Computational Dissection of Molecular Signature Multiplicity

Author: A Ploner
Alexander Statnikov
B Hammer
CF Aliferis
CF Aliferis
CF Aliferis
CF Aliferis
CF Aliferis
Constantin F. Aliferis
DL Gold
E Dougherty
F Azuaje
F Wagner
G Balazsi
G Natsoulis
I Guyon
I Tsamardinos
J Pearl
J Pearl
J Peña
J Shawe-Taylor
JP Ioannidis
L Ein-Dor
L Ein-Dor
L Li
LR Grate
M Hollander
P Roepman
RL Somorjai
S Michiels
S Ramaswamy
Scott Markel
SM Weiss
T Chu
TR Golub
TS Furey
X Qiu
Publication venue: Public Library of Science
Publication date: 01/05/2010
Field of study

Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections

Author: A Grishin
A Statnikov
AK Zaas
Alexander Statnikov
CF Aliferis
CF Aliferis
CF Aliferis
CF Wright
Constantin F Aliferis
GY Chen
J Dresios
Jörn-Hendrik Weitkamp
KA Carlson
Lauren McVoy
Nikita I Lytkin
O Kepp
O Ramilo
RJ Schneider
RR Novoa
T Ohman
UM Braga-Neto
VN Vapnik
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background A recent study reported that gene expression profiles from peripheral blood samples of healthy subjects prior to viral inoculation were indistinguishable from profiles of subjects who received viral challenge but remained asymptomatic and uninfected. If true, this implies that the host immune response does not have a molecular signature. Given the high sensitivity of microarray technology, we were intrigued by this result and hypothesize that it was an artifact of data analysis. Findings Using acute respiratory viral challenge microarray data, we developed a molecular signature that for the first time allowed for an accurate differentiation between uninfected subjects prior to viral inoculation and subjects who remained asymptomatic after the viral challenge. Conclusions Our findings suggest that molecular signatures can be used to characterize immune responses to viruses and may improve our understanding of susceptibility to viral infection with possible implications for vaccine development.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A comprehensive evaluation of multicategory classification methods for microbiomic data

Author: Alekseyenko Alexander V
Aliferis Constantin F
Blaser Martin J
Henaff Mikael
Konganti Kranti
Li Zhiguo
Narendra Varun
Pei Zhiheng
Statnikov Alexander
Yang Liying
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

BACKGROUND: Recent advances in next-generation DNA sequencing enable rapid high-throughput quantitation of microbial community composition in human samples, opening up a new field of microbiomics. One of the promises of this field is linking abundances of microbial taxa to phenotypic and physiological states, which can inform development of new diagnostic, personalized medicine, and forensic modalities. Prior research has demonstrated the feasibility of applying machine learning methods to perform body site and subject classification with microbiomic data. However, it is currently unknown which classifiers perform best among the many available alternatives for classification with microbiomic data. RESULTS: In this work, we performed a systematic comparison of 18 major classification methods, 5 feature selection methods, and 2 accuracy metrics using 8 datasets spanning 1,802 human samples and various classification tasks: body site and subject classification and diagnosis. CONCLUSIONS: We found that random forests, support vector machines, kernel ridge regression, and Bayesian logistic regression with Laplace priors are the most effective machine learning techniques for performing accurate classification from these microbiomic data

Springer - Publisher Connector

Texas A&M Repository

PubMed Central